NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

PCCL: Energy-Efficient LLM Training with Power-Aware Collective Communication

https://doi.org/10.1109/ICCD63220.2024.00023

Jia, Ziyang; Bhuyan, Laxmi N; Wong, Daniel (November 2024, IEEE)

Full Text Available
GreenMD: Energy-efficient Matrix Decomposition on Heterogeneous Multi-GPU Systems

https://doi.org/10.1145/3583590

Zamani, Hadi; Bhuyan, Laxmi; Chen, Jieyang; Chen, Zizhong (June 2023, ACM Transactions on Parallel Computing)

The current trend of performance growth in HPC systems is accompanied by a massive increase in energy consumption. In this article, we introduce GreenMD, an energy-efficient framework for heterogeneous systems for LU factorization utilizing multi-GPUs. LU factorization is a crucial kernel from the MAGMA library, which is highly optimized. Our aim is to apply DVFS to this application by leveraging slacks intelligently on both CPUs and multiple GPUs. To predict the slack times, accurate performance models are developed separately for both CPUs and GPUs based on the algorithmic knowledge and manufacturer’s specifications. Since DVFS does not reduce static energy consumption, we also develop undervolting techniques for both CPUs and GPUs. Reducing voltage below threshold values may give rise to errors; hence, we extract the minimum safe voltages ( V safeMin ) for the CPUs and GPUs utilizing a low overhead profiling phase and apply them before execution. It is shown that GreenMD improves the CPU, GPU, and total energy about 59%, 21%, and 31%, respectively, while delivering similar performance to the state-of-the-art linear algebra MAGMA library.
more » « less
Full Text Available
Improving Energy Saving of One-Sided Matrix Decompositions on CPU-GPU Heterogeneous Systems

https://doi.org/10.1145/3572848.3577496

Chen, Jieyang; Liang, Xin; Zhao, Kai; Sabzi, Hadi Zamani; Bhuyan, Laxmi; Chen, Zizhong (February 2023, ACM Principles and Practice of Parallel Programming)

Full Text Available
Synergy: A SmartNIC Accelerated 5G Dataplane and Monitor for Mobility Prediction

https://doi.org/10.1109/ICNP55882.2022.9940261

Panda, Sourav; Ramakrishnan, K. K.; Bhuyan, Laxmi N. (October 2022, 2022 IEEE 30th International Conference on Network Protocols (ICNP))

The 5G user plane function (UPF) is a critical inter-connection point between the data network and cellular network infrastructure. It governs the packet processing performance of the 5G core network. UPFs also need to be flexible to support several key control plane operations. Existing UPFs typically run on general-purpose CPUs, but have limited performance because of the overheads of host-based forwarding. We design Synergy, a novel 5G UPF running on SmartNICs that provides high throughput and low latency. It also supports monitoring functionality to gather critical data on user sessions for the prediction and optimization of handovers during user mobility. The SmartNIC UPF efficiently buffers data packets during handover and paging events by using a two-level flow-state access mechanism. This enables maintaining flow-state for a very large number of flows, thus providing very low latency for control and data planes and high throughput packet forwarding. Mobility prediction can reduce the handover delay by pre-populating state in the UPF and other core NFs. Synergy performs handover predictions based on an existing recurrent neural network model. Synergy's mobility predictor helps us achieve 2.32× lower average handover latency. Buffering in the SmartNIC, rather than the host, during paging and handover events reduces packet loss rate by at least 2.04×. Compared to previous approaches to building programmable switch-based UPFs, Synergy speeds up control plane operations such as handovers because of the low P4-programming latency leveraging tight coupling between SmartNIC and host.
more » « less
Full Text Available
Cottage: Coordinated Time Budget Assignment for Latency, Quality and Power Optimization in Web Search

https://doi.org/10.1109/HPCA53966.2022.00017

Zhou, Liang; Bhuyan, Laxmi N.; Ramakrishnan, K. K. (April 2022, IEEE International Symposium on High-Performance Computer Architecture (HPCA))

Full Text Available
pMACH: Power and Migration Aware Container scHeduling

https://doi.org/10.1109/ICNP52444.2021.9651911

Panda, Sourav; Ramakrishnan, K. K.; Bhuyan, Laxmi N. (November 2021, IEEE 29th International Conference on Network Protocols (ICNP))

Full Text Available
SmartWatch: accurate traffic analysis and flow-state tracking for intrusion prevention using SmartNICs

https://doi.org/10.1145/3485983.3494861

Panda, Sourav; Feng, Yixiao; Kulkarni, Sameer G; Ramakrishnan, K. K.; Duffield, Nick; Bhuyan, Laxmi N. (December 2021, CoNEXT '21: Proceedings of the 17th International Conference on emerging Networking EXperiments and Technologies)

Despite advances in network security, attacks targeting mission critical systems and applications remain a significant problem for network and datacenter providers. Existing telemetry platforms detect volumetric attacks at terabit scales using approximation techniques and coarse grain analysis. However, the prevalence of low and slow attacks that require very little bandwidth, makes flow-state tracking critical to overall attack mitigation. Traffic queries deployed on network switches are often limited by hardware constraints, preventing them from carrying out flow tracking features required to detect stealthy attacks. Such attacks can go undetected in the midst of high traffic volumes. We design SmartWatch, a novel flow state tracking and flow logging system at line rate, using SmartNICs to optimize performance and simultaneously detect a number of stealthy attacks. SmartWatch leverages advances in switch based network telemetry platforms to process the bulk of the traffic and only forward suspicious traffic subsets to the SmartNIC. The programmable network switches perform coarse-grained traffic analysis while the SmartNIC conducts the finer-grained analysis which involves additional processing of the packet as a 'bump-in-the-wire'. A control loop between the SmartNIC and programmable switch tunes the queries performed in the switch to direct the most appropriate traffic subset to the SmartNIC. SmartWatch's cooperative monitoring approach yields 2.39 times better detection rate compared to existing platforms deployed on programmable switches. SmartWatch can detect covert timing channels and perform website fingerprinting more efficiently compared to standalone programmable switch solutions, relieving switch memory and control-plane processor resources. Compared to host-based approaches, SmartWatch can reduce the packet processing latency by 72.32%.
more » « less
Full Text Available
Gemini: Learning to Manage CPU Power for Latency-Critical Search Engines

https://doi.org/10.1109/MICRO50266.2020.00059

Zhou, Liang; Bhuyan, Laxmi N.; Ramakrishnan, K. K. (October 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO))
null (Ed.)
Saving energy for latency-critical applications like web search can be challenging because of their strict tail latency constraints. State-of-the-art power management frameworks use Dynamic Voltage and Frequency Scaling (DVFS) and Sleep states techniques to slow down the request processing and finish the search just-in-time. However, accurately predicting the compute demand of a request can be difficult. In this paper, we present Gemini, a novel power management framework for latency- critical search engines. Gemini has two unique features to capture the per query service time variation. First, at light loads without request queuing, a two-step DVFS is used to manage the CPU power. Our two-step DVFS selects the initial CPU frequency based on the query specific service time prediction and then judiciously boosts the initial frequency at the right time to catch-up to the deadline. The determination of boosting time further relies on estimating the error in the prediction of individual query’s service time. At high loads, where there is request queuing, only the current request being executed and the critical request in the queue adopt a two-step DVFS. All the other requests in-between use the same frequency to reduce the frequency transition overhead. Second, we develop two separate neural network models, one for predicting the service time and the other for the error in the prediction. The combination of these two predictors significantly improves the power saving and tail latency results of our two-step DVFS. Gemini is implemented on the Solr search engine. Evaluations on three representative query traces show that Gemini saves 41% of the CPU power, and is better than other state-of-the-art techniques.
more » « less
Full Text Available
Swan: a two-step power management for distributed search engines

https://doi.org/10.1145/3370748.3406573

Zhou, Liang; Bhuyan, Laxmi N.; Ramakrishnan, K. K. (August 2020, Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED '20))

Full Text Available
Goldilocks: Adaptive Resource Provisioning in Containerized Data Centers

Zhou, Liang; Bhuyan, Laxmi; Ramakrishnan, K. K. (July 2019, 39th IEEE International Conference on Distributed Computing Systems (ICDCS 2019))

Power management in data centers is challenging because of fluctuating workloads and strict task completion time requirements. Recent resource provisioning systems, such as Borg and RC-Informed, pack tasks on servers to save power. However, current power optimization frameworks based on packing leave very little headroom for spikes, and the task completion times are compromised. In this paper, we design Goldilocks, a novel resource provisioning system for optimizing both power and task completion time by allocating tasks to servers in groups. Tasks hosted in containers are grouped together by running a graph partitioning algorithm. Containers communicating frequently are placed together, which improves the task completion times. We also leverage new findings on power consumption of modern- day servers to ensure that their utilizations are in a range where they are power-proportional. Both testbed implementation measurements and large-scale trace-driven simulations prove that Goldilocks outperforms all the previous works on data center power saving. Goldilocks saves power by 11.7%-26.2% depending on the workload, whereas the best of the implemented alternatives, Borg, saves 8.9%-22.8%. The energy per request for the Twitter content caching workload in Goldilocks is only 33% of RC-Informed. Finally, the best alternative in terms of task completion time, E-PVM, has 1.17-3.29 times higher task completion times than Goldilocks across different workloads.
more » « less
Full Text Available

« Prev Next »

Search for: All records